WalkIm: Compact image-based encoding for high-performance classification of biological sequences using simple tuning-free CNNs

The classification of biological sequences is an open issue for a variety of data sets, such as viral and metagenomics sequences. Therefore, many studies utilize neural network tools, as the well-known methods in this field, and focus on designing customized network structures. However, a few works focus on more effective factors, such as input encoding method or implementation technology, to address accuracy and efficiency issues in this area. Therefore, in this work, we propose an image-based encoding method, called as WalkIm, whose adoption, even in a simple neural network, provides competitive accuracy and superior efficiency, compared to the existing classification methods (e.g. VGDC, CASTOR, and DLM-CNN) for a variety of biological sequences. Using WalkIm for classifying various data sets (i.e. viruses whole-genome data, metagenomics read data, and metabarcoding data), it achieves the same performance as the existing methods, with no enforcement of parameter initialization or network architecture adjustment for each data set. It is worth noting that even in the case of classifying high-mutant data sets, such as Coronaviruses, it achieves almost 100% accuracy for classifying its various types. In addition, WalkIm achieves high-speed convergence during network training, as well as reduction of network complexity. Therefore WalkIm method enables us to execute the classifying neural networks on a normal desktop system in a short time interval. Moreover, we addressed the compatibility of WalkIm encoding method with free-space optical processing technology. Taking advantages of optical implementation of convolutional layers, we illustrated that the training time can be reduced by up to 500 time. In addition to all aforementioned advantages, this encoding method preserves the structure of generated images in various modes of sequence transformation, such as reverse complement, complement, and reverse modes.


Introduction
Classification, assigning input data to known classes of samples with similar features, has risen as an essential problem in biology studies so far. As a popular application of classification in biology, it categorizes creatures into various classes with specific evolutionary levels [1]. Depending on the type of creatures, the classification can be adopted in various fields; from MLDSP [8]) eliminate spatial information, while multivector based methods ignore information of local patterns within the sequences (e.g. multivector [14]). As the last but not the least drawbacks of the k-mer based methods, these methods, such as MLDSP [8] and FCGR-based method [15], requires specific assumption on the size of K-mers and their distributions. All aforementioned issues raised for vector-based methods restrict their applicability. For instance, k-mer frequency-based methods, such as [15], or multivector methods, such as [14], cannot address problems involving motif finding, local information discovery within the sequences, locating transcription factor binding sites, or structural variations discovery, specially at the presence of high mutation rate within the input sequences. On the other hand, most aforementioned issues discussed for feature-based methods are raised for model-based methods as well. For instance, COMET [3], as a model-based classification method for HIV-1 subtypes, utilizes variable-order Markov model for all reference sequences and obtains likelihood of queries' occurrence within each reference sequence. This method, similar to other methods in this category, necessitates selection of the best model order, adjustment of the window size, as well as the threshold value for recombination detection [5]. Although this method achieves good accuracy in classifying special taxa (HIV-1 subtypes), it is incompatible with other taxa (such as Influenza A) [3].
All the aforementioned challenges with various categorization approaches necessitate the development of genome classifier tools that are accurate, quick, and simple to use. In this regard, employment of Convolutional Neural Network (CNN), as a class of deep neural networks, has been proposed to address various concerns about accuracy [16]. Indeed, mostly fed by visual inputs, CNN is very successful in extracting essential features from the input images at the presence of noise [6]. During the last decades, the popularity of CNNs has been arising as a result of increased availability of computational resources, data sets, algorithms for training, and developed simple libraries for implementation [17]. On the other hand, since CNNs are generally successful in image processing, bioinformatics scientists came up with the idea of visualizing biological sequences as images [18][19][20]. Specifically, sequences visualization methods can encode various features of the input sequence, produce fixed size output image regardless of the input sequence length, and generate distinct signature from each bio sequence [18,21]. However, these achievements come at the cost of reduced system performance [16]. This has led to the widespread adoption of vector formats on CNN, such as one-hot encoding, rather than image formats [5,22]. However, the aforementioned advantages mentioned for visualizing biological sequences, as 2D or 3D images, are preserved. On the other hand, it should be noted that sequences visualization method can be also adopted in computational methods for encoding input sequences [15,23].
So far, we have discussed various benefits of adopting CNNs feed by input images for genome classification, but it should be noted that some serious drawbacks, especially in the case of large input data, limit their applicability [17]. Performing convolution operations at the consequent layers causes high run time, power, and storage consumption, and so, have other side effects, such as environmental impact [24,25]. Alongside, considering exponential growth of biological data sets as a result of improved sequencing technology, these issues become more critical. Although the processing time of training an ML model is not often considered within the run time of the classifier, its reduction has been targeted by many studied in the recent decade [5,6]. To resolve the computational complexity of CNN architecture, in this paper, we propose to migrate the implementation technology from electrical to optical domain to considerably improve run time and energy consumption of CNNs. Theoretically, we can achieve the computation speed of light and save energy up to 90%, and thus, reduce environmental degradation [26]. This solution is possible due to the easy implementation of convolution with two simple lenses in the field of optics and the use of data as an image [26], and a lot of works have been done to implement it in the both form of free space [27][28][29] and on-chip [25,26]. In this way, by converting sequences to standard images format, this solution can be used, and in addition to the very good extraction of features that CNN provides, the best time and the best energy consumption can be achieved compared to all other methods of ML.
All aforementioned advantages for sequences visualization motivate researchers to visualize bio sequences in the form of 2D or 3D images [15,[30][31][32]. Theoretically, all kinds of K-mer histogram of sequences [6,8,30], one hot representation, representation methods based on physicochemical properties [33], representation methods based on a combination vector of several types of information [34], and DNA-walk representation can be adopted to encode biological sequences as 2D images. Of course, it should be noted that methods like [33] that emphasize physicochemical properties are often developed for protein sequences, since the physicochemical properties of amino acids are much more diverse, and therefore, create a rich set of information. However, it should be mentioned that the corresponding generated images do not necessarily satisfy the standard image format required for their adoption in CNNs, which is a 2D arrays of color triplets whose values are in the range of 0 and 255. In addition, there are some flaws in genomic visualization techniques, as reported in Table 1, for some well-known visualization methods. As mentioned in this table, in [6], a one-dimensional matrix containing the K-mer frequencies feeds two types of CNN and DBN networks, similar to the encoding scheme adopted in [8]. This matrix contains 4 K entries of decimal numbers, each represents the frequency of the corresponding K-mer's state. Although the proposed Kmer counting method [6] can create a vector in particular and limited dimensions, its classification accuracy is significantly affected by the length of the substrings, or the value of k. As a result, reducing vector dimensions is limited to guarantee adequate precision. As another challenging issue for this encoding, it should be mentioned that for the small values of k, as usual for most biological tools, the vector's entries are extremely diverse. Therefore, information loss occurs as the result of mapping the vector's entries to the limited range of [0, 255] assumed for common image formats. Finally, since this encoding scheme loses the spatial information of the sequences, it cannot be adopted in a variety of applications, such as motif finding.
To overcome the aforementioned issues, various methods were proposed, such as Spa-tial_K-mer [30], to encode locations as well as frequencies of the K-mers. For this purpose, the sequences are first split into chunks of constant length, and then, the K-mer frequencies for each substring are determined to preserve the local information of the string. However, extracting K-mer frequencies for each substring increases the vector size, compared to the traditional K-mer counting method, while the vector's length is not fixed and is determined by the input sequence length. Furthermore, similar to usual K-mer frequency encoding methods, the vector's entries vary in a wide range, which is not the case for the common image formats.
To preserve spatial information of the sequence, DNA-walk [32][33][34] has been proposed, which produces an image containing an observable and recognizable pattern. In general, in this method, each nucleotide is represented as a two-dimensional vector, although it can be

PLOS ONE
WalkIm: Compact encoding for high-performance classification of biological sequences using tuning-free CNNs extended to three-dimensional versions or higher. To encode the sequences, each letter of the sequence is substituted by the relevant vector, and so, consequent vectors form a path for the whole sequence. Although DNA-walk produces meaningful image which can be even recognized by humans, it suffers from various drawbacks, as follows: a) the image size increases by the length of the sequence, b) data loss occurs as the result of subsequent vectors overlap, depending on the directions of the corresponding nucleotides' vectors. Specifically, overlap scenarios happen when subsequent vectors are perpendicular to each other, as happens for the method presented in [32], although there would be no data loss if the vectors do not return, as proposed in [34]. On the other hand, by quantizing vectors' length in the images, the DNA--Walk presented in is unable to distinguish various vectors with angle difference less than 45 degrees. Finally, considering sequence classification applications, several numerical encoding approaches, such as substituting various nucleotides with integer values [5] or One Hot encoding [22], have been proposed. However, since the size of the resultant encoded data depends on the sequence lengths, these methods cannot produce proper images for an optical CNN architecture. As reported in Table 1, the column "Size" in this table indicates the vector size for each method depending on string length or K value (for k-mer based methods). The "Local inf." column also shows whether the method keeps local data. The "S.I compatibility" column shows whether the encoding method is compatible with the Standard Image (S.I) format (3D matrix for RGB channel with finite integers in the range of 0 to 255).
To resolve most concerns mentioned in Table 1, this paper presents a novel image generation method to preserve key aspects of biological sequences. Designing an encoding approach based on DNA-walk, we can restrict output image sizes despite rising input sequence lengths and, most crucially, eliminate usage of prior knowledge or pre-processing of input sequences.
Our goal, based on what has been discussed, is to develop an encoding method that includes all of the following features: • Developing an image-based encoding strategy that takes advantage of CNNs' perfect imageprocessing abilities.
• Images are generated in typical image formats • Images include the sequence's general and local information • Image size that grows slowly in proportion to the sequence length.
More features are discussed in the following sections. The next step is to employ our new images in a very basic CNN with the goal of accurately classifying many different types of data sets in a generalized manner without any special settings for each data set. In reality, additional accuracy can be attained by fine-tuning CNN for each data set. In this study, we focus on DNA, but it can be generalized to RNA and Protein too.

Materials and methods
Designing deep learning methods involves two parts: a) input data pattern and its encoding, and b) neural network design, while optimized co-design of input coding method and neural network affects output accuracy. Although most studies focused on optimized network design to achieve high accuracy, it comes at the cost of increased network complexity, as well as high runtimes of learning and evaluation processes [6,8]. As discussed in this paper, the type of input data and its encoding approach can help us alleviate these problems. However, it should be emphasized that the encoding method should not be overly complicated, which in turn increases pre-processing time. So in this paper, a novel encoder-classifier method, named as WalkIm (DNA-walk based Image), is introduced to resolve the aforementioned issues. As shown in Fig 1, WalkIm consists of an encoding unit to perform input data processing, and a CNN unit to accomplish data classification. As follows, we explore input data encoding and CNN design of WalkIm in more details.

Input data
Since we do not customize WalkIm for a specific data type, to evaluate its accuracy, we prepare two assessments by addressing three frequently referred usages of sequence classification methods in various levels of evolution: a) viral classification [5], b) bacterial classification based on metagenomics data [6], and c) metabarcoding classification [37]. The viral classification itself includes five tests to classify various types of Dengue, Hepatitis B, Hepatitis C, HIV-1, Influenza A, and corona. Except corona, the corresponding data sets are collected as described in [5]. The corona data set is collected separately, with access details listed in the section "Data" of S1 File. It should be noted that the conditions for downloading data sets are given in [5], and as a result, the number of files available for each data set may vary based on the download time. Therefore, our data sets have different numbers of samples, compared to those of [5]. Specifically for the HIV (1) data set, with 37 categories mentioned in [5], we have imported 36 categories, while there was no sample for one of the categories mentioned in [5]. Bacteria taxonomy consists of four tests to classify the same samples generated using Amplicon (AMP), the next generation sequencing technology, into four levels of evolution (i.e. class, order, family, and genus), while these samples are accessible from [6]. The last data set is a barcoding data set consisting of cytochrome c oxidase subunit I (COI) DNA barcode sequences to taxonomic kingdoms [37]. Specifications of these data sets are summarized in Table 2, while all access information for these data sets are provided in section "Data" of S1 File.

WalkIm encoder unit
As the encoding unit of WalkIm, we modify DNA-walk encoding method to eliminate special process, as well as pre-knowledge requirements of input data and its distribution. This feature enables WalkIm to target any type of input sequences, either biological or non-biological textbased data. Moreover, outputs of the WalkIm encoding unit represent signature of the input sequences, which are distinguishable for varying human beings. In this manner, taking advantages of input signature, it avoids complex CNN structure to perform classification. This property is illustrated in Fig 2, which represent several sample images from several corona categories.
As the key advantages of DNA-walk representation, we can mention that it preserves locational information, and depicts nucleotides distribution within various fragments of the sequence, as well as the whole sequence. However, it offers various variations as follows; the earlier model of DNA-walk uses four main directions (i.e. west, east, north, and south) for representing each nucleotide, and hence a DNA (or RNA) sequence is plotted by consequent unit vectors in these directions [32,35,38]. As its main drawback, overlapping and crossing of the curves, representing DNA segments, cause information loss. As a modified version of the DNA-walk representation, [36] proposes adoption of four vectors with direction angles in the range of 0 to 90 degree to avoid path return. Despite this improvement, all variations of DNAwalk encoding methods lead to output images whose size depends on the length and distribution of the nucleotides. Nonetheless, almost all variations of DNA-walk encoding methods achieve good accuracy in comparing and classifying biological sequences [32,39].
As discussed, various versions of DNA-walk encoding method are designed to address its main drawbacks, such as information loss. However, even its recent versions still face some challenging issues. Specifically, Fig 3 shows the growing trend of DNA-walk encoding methods and their main challenges. For example, various studies set target to avoid probable overlaps of DNA pathways, as the main drawback of DNA-walk representation, by preventing loop formation through the pathway [40,41]. But this improvement comes at the cost of increased image size. On the other hand, various studies deal with DNA-walk encoding as a variable-size visual representation, since its output image's size depends on the input sequence pattern. To address these challenges, and many other ones, WalkIm encoding method is proposed to generate fixed-size output images, with fixed range of pixel values, which can be fed to any imagebased CNN classifier. Finally, although many studies target accuracy improvement of DNAwalk encoding, a few of them [32,39] discuss variety of its applications.
WalkIm's encoding unit is designed to generate standard output image format (i.e. discrete space, fixed size image with pixel values in the range of 0 to 255). It should be noted that it has two versions; 3 layers for RGB color channels and 1 layer for Grayscale format. Every pixel of the image, except for those on the sides, has 8 neighbor pixels alongside its sides and corners, as shown in Fig 3, and so has 8 directions to move forward. As follows, various steps of Walk-Im's encoding unit are presented: 1. Consider a square of size M×M. Table 2. Summarized information of utilized data sets; � one class of this data set has over 300 thousand samples, and we take a random subsample of it with 2000 samples to balance the data set, as described in [37].  3. Assign each square corner to one of the four possible nucleotides (i.e. A, C, G, and T).
4. Pars the input DNA sequence (from 3'-end to 5'-end). By reading each nucleotide, move toward the corresponding direction by one pixel, and increase the pixel value by a constant

PLOS ONE
value. This step is done for a grayscale format in a single-layer 2D matrix. While for a threelayer format, three of the four letters are assigned to each of three layers of RGB image, and the fourth letter is assigned to two of three layers.
5. If next move hits the image sides, return to point O and go to step 4. O.W., continue to step 6.

End of encoding.
For more clarity, the above procedure is schematically depicted in Fig 4. It is worth noting that the constant value, added to the matrix cells in each step, can vary. For high image resolution and contrasts, it is recommended to set this value to more than 10 (instead of one). In this case study, this value is set to 255. Although the produced values in the matrix may exceed 255, the maximum pixel value, Python automatically converts these values to the range of 0 to 255 when saving the image content, and hence, it preserves the information during the encoding procedure.
Main points of the proposed encoding method are described as follows: 1. Considering non-binary pixel values, overlaps of the pathway can be mostly traced, and hence, information loss is avoided, unlike previous versions of DNA-walk [32,35]. This property is illustrated in Fig 5A. 2. In the case that each pair of complement nucleotides (i.e. (C and G), (A and T)) is assigned to two corners on the same diameter, the encoded sequences, either from 3'-end to 5'-end or from 5'end to 3'-end, would be in similar shapes, as well as symmetric with respect to the central pixel of the square, as shown in Fig 5B. This also happens for the reverse complement sequence, as shown in Fig 5C. 3. Overall distribution of nucleotides throughout the input sequence and their statistical features can be visually analyzed by studying the pathway directions, as shown in Fig 5D. For example in this sample figure, the instance shape is placed on top half of the image which means the numbers of A and C are more than those of G and T, Moreover, there is an equal number of pixels in the left and right halves and a reciprocating path is created between the two halves, which means that both halves of the sequence have almost the same numbers of A and G, compared to those of C and T.
4. Encoding an input sequence within a square with specific size can be performed in two ways; a) the proposed encoding method generates coded image within a square with desired size, or b) it produces a larger image, and afterward, downscales it to an image with the desired size. In the latter case, while reducing the number of times the pathway hits the sides of the square, the overall shape of the pathway is preserved, as shown in Fig 5E. In this manner, we can increase classification accuracy with providing more directions for various moves.

PLOS ONE
straightforwardly [5]. This complexity arises from the different data dimensionality of genomes, compared to traditional image formats. Specifically, one-dimensional genomes cannot be easily feed to convolutional networks operating on two-dimensional images. On the other hand, although various studies propose restructured CNNs [5,6,17] to be compatible with genomic data, they reduce data processing capability of these networks.
In this manner, we propose a novel encoding method to facilitate adoption of powerful convolutional architectures for genomics data processing. Specifically, we encode each sequence as an image and feed it to a CNN. Moreover, instead of using complex convolutional architectures, we adopt a simple and shallow convolutional neural network, named as WalkIm classifier unit, for four reasons: • Powerful encoding method: To emphasize the capability of WalkIm encoding method, and its impact on the classification accuracy, we take advantages of a few convolutional layers, unlike popular DNNs.
• Facilitating optical implementation: Taking advantages of optical implementation of convolutional layers [26,27], the proposed neural network can be easily implemented in the optical domain. Specifically, Well-known 4f optical correlator is a common architecture for performing convolution operation in free space optics [27,47]. This system is based on the notable Fourier transforming properties of converging lenses. Specifically, the structure of a 4f correlator system consists of an input plane, first lens, Fourier plane, second lens, and Some key features of WalkIm encoder; a) usually, input sequences can be rebuilt (i.e. decoded) from the encoded image and the generated shape is traceable, b) the shape generated from the reversed sequence (S r ) is symmetric about the x axis, compared to its original shape, c) the shape generated from the reverse-complemented sequence (S rc ) is symmetric about the x axis, compared to its original shape, d) the generated shape contains some statistical information. https://doi.org/10.1371/journal.pone.0267106.g005

PLOS ONE
output plane [47]. Detailed explanations on this configuration are provided in S1 File section "Optical CNN setup." • Facilitating PC-based implementation: Due to the large size encoded images and the massive data sets, a shallow convolutional neural network is proposed to enable network implementation on normal desktop computers.
• Eliminating parameter initialization: As a key advantage of WalkIm, it does not require initialization of network parameters for varying (or new) data sets.
Considering all aforementioned explanations, we provided the WalkIm classifier unit in two versions, simple and a little deeper, in order to investigate the effect of network depth on the classifier's accuracy. It should be noted that almost all recently developed neural network for sequence classification [5,6] are customized with specific parameters values for each data type and species to achieve acceptable classification accuracy. However, as a key advantage of our proposed encoding method, we do not impose such network settings. In this manner, to clarify the power of the WalkIm encoder, we just employed the most basic networks to attain similar accuracy, compared to the alternative tools. Of course, so much better results can be obtained by configuring a classifier unit for each category. The architectures of two versions of WalkIm networks proposed for genome classification is presented in Fig 6. These are convolutional classifier models whose input images are produced by WalkIm encoder. It is worth noting that unlike most existing methods, such as VGCD and [48] size of the input images in WalkIm network does not depend on genome length. Specifically, without loss of generality, we resize encoded images to 256 by 256. In this manner, our input images have dimensions of

PLOS ONE
256 × 256 × 3 (for RGB format) and 256 × 256 × 1 (for grayscale format), while these sizes can be increased or decreased, with no impact on the classification accuracy.

CNN characterization
The proposed CNN architecture produces a vector P of size 1 × N, where N is the total number of classes (i.e., viral subtypes) for the given problem, and entry P i 2 P (1 � i � N) represents the probability of a given genome, encoded and fed as the input image to CNN, belonging to the i-th class.
As depicted in Fig 6, there only exist two parts dedicated to a) the convolutional layers, performing feature extraction, and b) the fully-connected layers, predicting the genome subtype, based on the features extracted by the convolutional part. The detailed explanations are provided as follows.
a. Convolutional layers: Input images are fed to three consequent and similar convolutional blocks (shown as Conv Block1, Conv Block2, and Conv Block3), each has one and two 2D convolutional layers for simpler network (CNN simple in Fig 6A) and more complex network (CNN complex in Fig 6B), respectively. Each convolutional layer is followed by a ReLU (Rectified Linear Unit) activation layer to improve training performance [49].
In the proposed CNN architecture, each convolutional layer convolves the corresponding input image with a set of learnable filters whose coefficients are learned through the network training process. In WalkIm network, filters of size 3 × 3 are assumed, while the number of filters is increased by a factor of two from each convolutional block to the next. Specifically, it varies from 8 (for CNN simple in Fig 6A) and 64 (for CNN complex in Fig 6B) for the first convolutional block to 32 and 256 for the third one. The rectified linear activation layers (ReLU), following the convolutional layers, introduce non-linearity to reduce over-fitting. Specifically, ReLU reduces the vanishing gradient problem, and avoids back propagation errors, while it is much faster, compared to sigmoid activation function.
As the last layer of convolutional block, 2D pooling layers follow ReLUs and perform maxpooling operation with the pooling filter of size 2 × 2 and the stride of 2 for both simpler and complex networks. Although pooling operation reduces the input size, it extracts characteristic genomic features and propagates them to the dense layers. Finally, the output of the last max-pooling layer is converted into a 1D feature vector to be used by the classifier part of the network, as follows.
b. Classifier layers: As the classifier part of the network, WalkIm takes advantages of two dense layers with the decreasing number of neurons, from 128 neurons in the first dense layer to N neurons in the second one, where N is the number of subtypes within each data set. The first dense layer, followed by a ReLU layer and a dropout layer, feeds the last layer implementing softmax activation function. Finally, WalkIm produces probabilities of a given genome sequence belonging to each class. It is crystal clear that the genome is classified as the subtype with the highest probability value.

Training parameters
To evaluate classification accuracy and runtime of WalkIm network for various data sets, fivefold cross-validation for viral [5] and metabarcoding [37] data sets, and ten-fold cross-validation for metagenomics [6] data sets are performed. In each experiment, the network is trained for a maximum of 30 epochs, unless the early stopping condition is fulfilled, when the training is stopped after several successive epochs with no training improvement. According to our simulation studies, usually, the training converges at about 15 epochs. Adam optimizer [50] with the learning rate of 0.001 is adopted to minimize the categorical cross-entropy loss function, and the mean squared error is utilized to measure performance of the model. The batch size is assumed to be 64, except for the large number of training samples, where we assume the batch size of 256. Finally, it should be noted that the aforementioned values of hyper parameters are determined in a trial-and-error manner balancing training time versus training performance.
It is worth noting that as a key advantage of WalkIm classifier, values of CNN parameters, e.g. filter size w, do not depend on the length of input genomes and can be constant for all data sets. The size of CNN input vector (i.e. n) is equal to the product of image's dimensions. While, the size of output vector (i.e. N), generated by the CNN, is equal to the number of virus subtypes to be predicted.

System specification
The proposed CNN is implemented in Python 3.

Metrics for comparison
Performance of a classifier is generally measured by popular metrics combining four basic metrics, i.e. TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative). TP is the number of cases correctly identified as members of a class. FP is the number of cases incorrectly identified as members of a class. TN is the number of cases correctly identified as non-members of a class. And finally, FN is the number of cases incorrectly identified as non-members of a class. We also use five reputable metrics, based on four basic ones [5]. Specifically, these five reputable metrics, i.e. sensitivity (Se), specificity (Sp), precision (PREC), accuracy (ACC), and F1-score (F1) are defined by Eqs 2 to 5 respectively.
For a more detailed description, we should note that Se indicates the percentage of a class members correctly identified as members of that class. Sp shows the percentage of a class nonmembers correctly identified as non-members. PREC indicates the percentage of items identified for a class and actually belong to it. ACC indicates the percentage of correct diagnoses have been made in total (whether the class members or non-members are correctly identified), and finally, F1-score is the harmonic mean of precision and sensitivity values, and it is advisable when a balance between precision and sensitivity is required and many actual negatives exist. In addition to these metrics, as follows, we also present confusion matrices obtained from each test to analyze WalkIm performance and features of data sets in Section "Confusion matrices" of S1 File.
As a comprehensive simulation study, WalkIm method is investigated and analyzed from several aspects. Specifically, we have created some comparative simulation scenarios as follows: 1. First, we use CNN simple to compare the grayscale and RGB modes of the created images in terms of classification accuracy. Afterwards, considering the more accurate image mode from the previous analysis, CNN complex is employed to investigate accuracy of WalkIm encoding method for various data sets in more details. In this manner, the impact of image type and network depth on classification performance can be deeply analyzed. As the key advantage of the proposed encoding method, these CNNs are adopted, with no specific initial settings, to analyze performance of WalkIm encoding. They are fed by data sets of three types of biological data, i.e. virus classification data [5], metagenomics data [6], and metabarcoding data [37], with variations in the number of samples, categories, and string lengths.
2. We estimate the training runtime of the optical CNN architecture, adopting WalkIm method, for each data set and compare it to the values published in the reference papers.

Performance comparison
As discussed above, we adopt three types of data from three most relative papers [5,6,37] to analyze encoding capability of WalkIm feeding the neural networks and for a fair comparison, similar comparison scenarios are adopted. We first evaluate both grayscale and RGB formats with CNN simple , as indicated in the previous section, while the comparison metrics are shown in Tables 3 to 5. At the next step, depending on which image format works with the simpleset network (i.e. CNN simple ), we analyze it in more details with a deeper network (CNN complex ). These results are also given in Tables 3 to 5 for each data set, while the corresponding discussion are presented as follows. Moreover, confusion matrices are reported in the appendix's "Confusion matrices" section.

Viruses
In Table 3, we compare the WalkIm results for viral data sets with three tolls: a) COMET (as a Markov-based approach customized for HIV and Hepatitis C viruses) [51], b) CASTOR (as a RFLP-based web tool) [3], and c) VGDC tools (as a viral genome data collection tools based on sequence ASCII encoding and CNN) [5]. Of course, as mentioned in the previous section, the results of corona data set are reported separately.
According to Table 3, the fact that increasing the number of categories reduces the classification accuracy can be deduced in all approaches. It may appear that all formats of WalkIm are less accurate than other approaches, albeit by a modest margin. However, the point is that the CNNs, used in WalkIm, are not customized for any data set, whereas in other approaches they are finely tuned. So, although the difference is not significant, WalkIm classifier would achieve better results if it is customized as well. Especially, for datasets like the influenza A, accuracy improvement is readily achievable by neural network customization. The influenza A genome is made up of eight single strands, as found in each class of the datasets utilized in this study. As a result, any imbalance of each item in these classes, as well as the range of their encoded shapes, makes classification difficult, considering that these sequences, compared to

PLOS ONE
WalkIm: Compact encoding for high-performance classification of biological sequences using tuning-free CNNs other datasets' sequences, are shorter and have a higher mutation rate. As a result, compared to alternative datasets, the influenza A genomes are more difficult to categorize, and hence, establishing certain hyperparameters for the network, as addressed in many related studies [5,6], can facilitates the classification task. Moreover, it should be noted that considering CNN simple performance, it achieves higher accuracy with the grayscale input format, compared to the RGB format. On the other hand, while RGB format scored better (on average) across all data sets (i.e. viruses [5], metabarcoding [37], and metagenomics [6]), we adopt it for evaluating virus database with CNN complex . The results show that the RGB format, fed to the CNN complex , leads to higher accuracies compared to the other two test modes (RGB and grayscale images fed to the CNN simple ). Finally, it should be emphasized that the differences in classification accuracies of adopting WalkIm to different networks are so small that can be compensated with minor adjustments, such as increasing the sample size of each class or the test data. As a result of the lesser amount of required data, it can be claimed that grayscale produces an acceptable output. WalkIm images, on the other hand, provide key advantages over other approaches, particularly VGDC [5] as a CNN-based counterpart, as follows: Table 5. Classification performance measures achieved by various metagenomics classifiers for four evolutionary levels (i.e. class, order, family, and genus). Character "-"means that the corresponding measure is not reported by the reference article.  Table 4. Accuracies of various classification algorithms fed by metabarcoding data set (all methods, except WalkIm, have used 4-mer encoding). Since [37] only reports accuracy of five classification methods and the confusion matrix of DNN, we computed performance metrics of DNN by means of its confusion matrix. Character "-"means that the corresponding measure is not reported by the reference article.

PLOS ONE
1. Increased scalability: The length of input sequences has no linear effect on the size of input image. However, since [5] employs integer numbers to encode each nucleotide of the input sequence, increasing the length of input sequence directly affects the network input dimensions.
2. Decreased runtime: For the network fed by WalkIm images, very high convergence is achieved after six or seven epochs, whereas usually 1000 epochs are required for VGDC [5] (of course in some cases results are achieved in 200 epochs). In this regard, runtime can be considerably reduced. To clarify this outperformance, the corresponding diagrams for accuracy convergence, in terms of number of epochs, for training and validation sets of Dengue data set) as an example of viral data sets) is shown in Fig 7A. 3. Decreased input data volume: With the ability to scale WalkIm images, the amount of CNN input data can be reduced. For example, a sequence of 11,000 characters can be encoded in a 32 by 32 matrix with 1024 entries. Of course, this advantage is more evident in data sets with long sequences.

Metabarcoding
For evaluating capability of WalkIm to encode and classify metabarcoding data set, we compare it against [37] which encodes input data using a K-mer based method, and reports the best ones (4-mers) as the basis for choosing proper classification algorithm. Afterwards, [37] evaluates five classification algorithms (i.e. DNN, SVM, K Nearest Neighbors, Random Forest, and XGBoost), and declares DNN to be the best one. Although not frequent, in [37], a small and unbalanced subset of data is chosen from outside the main data set as the test data; yet, their simulation results are given in Table 4 alongside our method's results. As shown in Table 4, although no customization is performed for WalkIm network to classify metabarcoding data, it can obtain higher accuracy, compared to DNN, while other metrics are also extremely close. Since RGB format has better results in most data sets, we have evaluated this format utilizing CNN complex . For this purpose, we employed CNN complex on this data, which surprisingly leads to a similar accuracy, compare to CNN simple fed by grayscale format. Thus, for this type of data set, CNN simple with RGB encoding can achieve a good result and no more complex CNN is needed. Finally, we would like to emphasize that our network performs brilliantly in terms of convergence speed during training phase (as shown in Fig 7B), but due to the lack of similar result for DNN, we cannot report the comparison results in this issue.

Metagenomics
For examining Metagenomics data sets, 16S short-read data are obtained using next-generation sequencing technology amplicon (AMP) which only considers some specified 16S hyper variable regions. The AMP data set is quite targeted, in the sense that it comprises the majority of the information content. As this data set are created by [6], we compare our simulation results against those reported in [6], as shown in Table 5.
[6], like other K-mer based methods, examines different k-mer sizes (i.e. 3 to 7) as well as three classification algorithms (i.e. CNN, DBN and RDP), and so, its best achieved result (i.e. 7-mer with CNN) is chosen for our comparative study. We would like to mention that since [6] focuses on the evolutionary level of the genus, all performance metrics are reported for this level (while for other levels, some metrics are missing). In this regard, we can compare it with our method at the genus level in more details and we cannot accurately compare WakIn to DLM-CNN. Of course, genus level classification is especially important due to its challenging issues of high number of classes and significant similarities among various samples of classes.
As shown in Table 5, the categorization becomes increasingly difficult for both approaches as the number of categories increases. However, by increasing the number of categories, all versions of WalkIm achieves a superior result over DLM-CNN method without requiring any particular customization for this data set. Since RGB format has better results in most data sets, we have included this format in CNN complex . For this purpose, we used CNN complex to evaluate RGB encoding without any additional adjustments, and the results were as expected: a significant improvement in all metrics. By providing more information than grayscale encoding and deepening the network for metagenomic data sets, RGB format of WalkIm encoding is projected to produce significantly better outcomes, compared to its counterparts [5,6,37]. WalkIm not only improves accuracy and performance, but it also minimizes the size of the input data and speeds up the training process.
It should be noted that there is no report of DLM-CNN with which we can compare the convergence speed of training and validation processes for metagenomics data sets. However, for runtime comparison, we obtained the convergence diagram of this data set considering WalkIm method for sequence encoding, as shown in Fig 7C. These results indicate that the convergence speed is considerably improved, similar to the results achieved for viruses and metabarcoding data sets in Fig 7A and 7B, respectively. The convergence diagrams versus the number of epochs are shown in Fig 7C as an example for these data sets for the AMP class.

Computation time comparison
As discussed before, optical technologies can considerably speed up CNN architectures. For detailed runtime analysis of optical CNN, we formulate the computational latency and mathematically estimate the runtime of optical CNN utilized for implementation of WalkIm. As shown in S1 Fig in S1 File, considering the conventional optical setup implementing CNN, optical beams pass through each layer of CNN, and hence, each layer adds a certain amount of latency to their path [27]. As a result, total latency (i.e. the time it takes for data to enter and exit a CNN structure) can be calculated by the number of layers within the CNN, and the propagation latency of each layer. In this manner, Eq 6 and Eq 7 formulate the latency of WalkIm's CNN architecture.
Where, T input is the time it takes to feed the input image to the network, T conv , T ReLU , and T MP represent propagation latencies through each optical convolution layer, optical ReLU layer, and optical max pooling layer, respectively, T camera is the time it takes to capture network's output image by the CCD camera, as the end point of the optical setup, and finally T transferData represent transfer delay of the cable interconnecting CCD camera to the computer system. As obvious, various coefficients in Eq 6 and Eq 7 represent number of corresponding optical layers within the optical setup. For more detailed description, Section "Speed analysis and estimation of optical WalkIm CNN" in S1 File provides more information on this topic.
Considering conventional values for optical latency parameters, as listed in Table 6, we can estimate runtime of the proposed optical structure for both CNNs (i.e. CNN simple and CNN complex ) as 0.45 ms, taking into account input to output data propagation through the WalkIm optical CNNs. The reason for the lack of time difference between the optical structures of the two CNNs is the negligible propagation delays of the extra optical layers included in CNN complex architecture, compared to CNN simple architecture. In this manner, multiplying the number of input samples in each training set by the calculated "latency" is all that is required to compute the training time. Since 80% (for five-fold assessments) or 90% (for tenfold assessments) of each data set is usually considered as the training set, the corresponding training time of the optical network can be easily calculated for various architectures, as shown in Table 7.
According to Table 7, runtime of DNN metabarcoding is not reported in [37], while the authors only confirm that runtime and memory utilization can be a serious concern during the training process. For the same reason, it is not possible to compare WalkIm method in terms of runtime for metabarcoding data set. On the other hand, since runtime of the training phase for one fold of metagenomics data set is reported in [6], we also provide runtime of the training phase for optical and electrical implementations of WalkIm for one fold. For a fair comparison of VGDC runtime with that of WalkIm, we should note that VGDC reports the execution time of each epoch [6]. So, we multiply this time by the number of epochs required for training each data set to achieve the training runtime of a fold. As shown in Table 7, training runtimes for VGDC are reported within the ranges of minimum and maximum values, because of the fact that while the maximum number of training epochs for VGDC is 1000, it has been claimed that some data sets had an early stop at about 200 epochs [5]. As can be seen in Table 7, for Hepatitis B (1), Hepatitis C, HIV (1), HIV (2), metagenomics family and metagenomics ordeer data sets, runtime of electrical implementation of Walk-Im's CNN complex exceeds the maximum runtime reported for VGDC. However, it should be noted that when comparing the runtimes of any two methods, the corresponding hardware specifications should be considered as well. Considering this comparative study, it is worth noting that VGDC takes advantages of a more powerful hardware implementation, compare to WalkIm. The detailed hardware specifications for implementing VGDC and WalkIm are listed in Table 8. Moreover, it should be noted that while VGDC runs on a GPU, the electrical Table 7. Runtime for training networks in three methods VGDC (viral whole-genome classifier), DLM-CNN metagenomics (metagenomics classifier), and WalkIm (general-purpose sequence classifier with two electrical and optical implementation modes). DNN metabarcoding [37] that classify metabarcoding data did not report the corresponding runtime. Character "-"means that the corresponding measure is not reported by the reference article. version of WalkIm runs on a CPU, taking into account that generally, GPUs can reduce the runtime by, at least, 4 to 5 times [55]. In this manner, we can conclude that even the electrical implementation of WalkIm's CNN complex is substantially faster than VGDC. And of course, the most crucial point is that, in these data sets, the electrical implementation of CNN complex performed worse, whereas the electrical implementation of CNN simple , which performed better, is still faster even with these hardware limitations. It is crystal clear that the optical implementation of WalkIm is much faster than all other electrical counterparts, with a speed up by a factor of 60 to more than 550. Similar to the comparative study of WalkIm and VGDC for the virus data sets, as follows, we compare training runtimes of WalkIm and DLM-CNN for the metagenomics data set. As shown in Table 7, by increasing the number of categories in the family and genus data sets, the speed of WalkIm reduces compared to DLM-CNN. However, it should be emphasized in [6], execution runtime of DLM-CNN is reported for a cluster of 24 processing nodes (each includes single CPU and 48 GPUs), which is tens of times more powerful than the desktop computer on which WalkIm is executed. Therefore, we can conclude assuming similar hardware specifications, even the electrical implementation of WalkIm considerably outperforms DLM-CNN.

Data sets
Finally, when it comes to WalkIm's execution time, it is important to note that the size of the input image influences the training speed. Initially, we used to implement this function with the image size of 256 × 256 pixels. However, we scaled these images down to the smaller sizes to the extent that accuracy is not reduced. As a result, we can simply reduce the input size while increasing the training speed and preserving the accuracy. In addition to speed enhancement, we demonstrated that WalkIm encodes information in a way that scaling cannot affect them. Detailed information on data set scaling can be found in the section "Encoding details" of S1 File.

Conclusion
WalkIm, our proposed encoding method, focuses on image representation of biological sequences and their usage in Convolutional Neural Networks (CNN). While it is too efficient to be implemented even on the simple desktop systems, it is compatible with free-space optical technology, empowering CNN implementation for big data processing. WalkIm encoding, as a novel extension of DNA-walk encoding, offers various advantages, such as statistical interpretability of the nucleotide distribution, as well as similarity of encoded normal, reversed, and reverse-complemented sequences. Although WalkIm, as a universal method, can be used to classify any sequences based on their DNA and RNA strings, in this paper, we evaluate it by classifying virus sequences (e.g. Coronaviruses, Dengue, HIV, Hepatitis B and C, and Influenza A), metagenomics data, and metabarcoding data. In this study, WalkIm was able to compete with state of the art methods of each field (VGDC [5], COMET [51] and CASTOR [3] for viruses subtyping, DLM-CNN for metagenomics data [6], and [37] for metabarcoding data) in terms of accuracy and training speed without imposing any network adjustments for a specific data set Although tuning-free property of WalkIm facilitates its usage for classifying various data sets with no initialization phase, by proper adjustment of network parameters for each data set, WalkIm can significantly outperform other methods as well. Moreover, WalkIm performance is such that while maintaining accuracy, compared to alternative methods, it can improve the training speed on desktop systems from 1.5 times to 1500 for various data set. We have also shown that taking advantages of free space optical technology for WalkIm implementation, we can improve training speed by more than 400 times, compared to its electrical implementation. It is worth noting that for complex neural networks and large data sets, running WalkIm on a desktop achieves up to 26 times higher speed than alternative methods, like DLM-CNN. Finally, we compared WalkIm with some of the existing fast and accurate methods, such as CASTOR for classification of viruses, where WalkIm reached similar accuracy for various data sets, while CASTOR completed training of 250,000 samples after a few days. Despite all these advantages, WalkIm also faces some challenges. Indeed, although we have evaluated the image's size and scale for all examined datasets, these parameters must be investigated for other datasets as well. To be able to employ any type of data with the specified parameters, it is required to address the relationship between the image dimensions for WalkIm encoding and the length and type of the sequences in the future works.